This is a rewrite of my County Cluster notebook - originally written in Python in Jupyter Notebook - using R. The code may not be pretty, buy my goal is simply to practice R and demonstrate basic competence.
Loading packages, setting random seed
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ──────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(modelr)
library(maps)
Attaching package: ‘maps’
The following object is masked from ‘package:purrr’:
map
library(mapproj)
library(htmlwidgets)
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
set.seed(123)
Read in datasets
ed_data <- read_csv("../data/Education.csv", locale = locale(encoding = "Latin1"))
Rows: 3283 Columns: 47
── Column specification ─────────────────────────────────────────────────────
Delimiter: ","
chr (2): State, Area name
dbl (25): FIPS Code, 2003 Rural-urban Continuum Code, 2003 Urban Influenc...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pop_data <- read_csv("../data/PopulationEstimates.csv", locale = locale(encoding = "Latin1"))
Warning: One or more parsing issues, see `problems()` for details
Rows: 3274 Columns: 149
── Column specification ─────────────────────────────────────────────────────
Delimiter: ","
chr (3): FIPS, State, Area_Name
dbl (57): Rural-urban_Continuum Code_2003, Rural-urban_Continuum Code_201...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pov_data <- read_csv("../data/PovertyEstimates.csv", locale = locale(encoding = "Latin1"))
Rows: 3193 Columns: 34
── Column specification ─────────────────────────────────────────────────────
Delimiter: ","
chr (2): Stabr, Area_name
dbl (17): FIPStxt, Rural-urban_Continuum_Code_2003, Urban_Influence_Code_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
unm_data <- read_csv("../data/Unemployment.csv", locale = locale(encoding = "Latin1"))
Rows: 3275 Columns: 56
── Column specification ─────────────────────────────────────────────────────
Delimiter: ","
chr (3): State, Area_name, Median_Household_Income_2018
dbl (17): FIPS, Rural_urban_continuum_code_2013, Urban_influence_code_201...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Taking a look at the education data
head(ed_data)
And the ed_data dimensions
dim(ed_data)
[1] 3283 47
Now looking at population data - it had a problem when reading in, we’ll have to interrogate that more later
head(pop_data)
pop_data dimensions
dim(pop_data)
[1] 3274 149
And now poverty data
head(pov_data)
poverty dimensions
dim(pov_data)
[1] 3193 34
And finally, unemployment data
head(unm_data)
and unemployment data dimensions
dim(unm_data)
[1] 3275 56
A quick look at the summaries of each
#summary(ed_data)
#summary(pop_data)
#summary(pov_data)
#summary(unm_data)
So it turns out that while summary() in R returns much the same information as .describe() does in pandas, it does it in a really ungainly format. I don’t think I’ll be doing that again. I’ve commented out the summaries to avoid clogging the notebook. At this point in pandas I listed out the data types in each dataframe, but here those have already been displayed on the head(). I also loaded in a shape file with the geometries of the counties, but this was ultimately only used for plotting and the maps package in R already covers US county mapping, so I won’t be using the shape file for this.
First, a tiny bit of housekeeping - making all the FIPS columns called just “FIPS”.
ed_data <- rename(ed_data, FIPS = "FIPS Code")
pov_data <- rename(pov_data, FIPS = FIPStxt)
Before joining, need to check that FIPS are actually unique
dim(distinct(select(ed_data, FIPS)))[1] == dim(ed_data)[1]
[1] TRUE
dim(distinct(select(pop_data, FIPS)))[1] == dim(pop_data)[1]
[1] TRUE
dim(distinct(select(pov_data, FIPS)))[1] == dim(pov_data)[1]
[1] TRUE
dim(distinct(select(unm_data, FIPS)))[1] == dim(unm_data)[1]
[1] TRUE
All good on that front. Of the four dataframes, three read in FIPS as dbl while pop_data read it in as chr. While the latter is more technically correct, I’ll convert pop_data FIPS to dbl so all four are on the same page when it comes to leading zeros and data types.
(pop_data <- mutate(pop_data, across(FIPS, as.double)))
And now to join them all on FIPS. And yes, the name of big_data is tongue-in-cheek.
big_data <- ed_data %>%
full_join(pop_data, by = c("FIPS")) %>%
full_join(pov_data, by = c("FIPS")) %>%
full_join(unm_data, by = c("FIPS"))
Let’s take a look at the result
head(big_data)
dim(big_data)
[1] 3284 283
and now let’s see how many nulls there are
sort(colSums(is.na(big_data)))
FIPS
1
State.x
1
Area name
1
State
9
Area_name.y
9
Less than a high school diploma, 2014-18
11
High school diploma only, 2014-18
11
Some college or associate's degree, 2014-18
11
Bachelor's degree or higher, 2014-18
11
Percent of adults with less than a high school diploma, 2014-18
11
Percent of adults with a high school diploma only, 2014-18
11
Percent of adults completing some college or associate's degree, 2014-18
11
Percent of adults with a bachelor's degree or higher, 2014-18
11
State.y
11
Area_Name
11
CENSUS_2010_POP
11
ESTIMATES_BASE_2010
11
POP_ESTIMATE_2010
11
POP_ESTIMATE_2011
11
POP_ESTIMATE_2012
11
POP_ESTIMATE_2013
11
POP_ESTIMATE_2014
11
POP_ESTIMATE_2015
11
POP_ESTIMATE_2016
11
POP_ESTIMATE_2017
11
POP_ESTIMATE_2018
11
Less than a high school diploma, 2000
12
High school diploma only, 2000
12
Some college or associate's degree, 2000
12
Bachelor's degree or higher, 2000
12
Percent of adults with less than a high school diploma, 2000
12
Percent of adults with a high school diploma only, 2000
12
Percent of adults completing some college or associate's degree, 2000
12
Percent of adults with a bachelor's degree or higher, 2000
12
Civilian_labor_force_2010
12
Employed_2010
12
Unemployed_2010
12
Unemployment_rate_2010
12
Civilian_labor_force_2011
12
Employed_2011
12
Unemployed_2011
12
Unemployment_rate_2011
12
Civilian_labor_force_2012
12
Employed_2012
12
Unemployed_2012
12
Unemployment_rate_2012
12
Civilian_labor_force_2013
12
Employed_2013
12
Unemployed_2013
12
Unemployment_rate_2013
12
Civilian_labor_force_2014
12
Employed_2014
12
Unemployed_2014
12
Unemployment_rate_2014
12
Civilian_labor_force_2015
12
Employed_2015
12
Unemployed_2015
12
Unemployment_rate_2015
12
Civilian_labor_force_2016
12
Employed_2016
12
Unemployed_2016
12
Unemployment_rate_2016
12
Civilian_labor_force_2017
12
Employed_2017
12
Unemployed_2017
12
Unemployment_rate_2017
12
Civilian_labor_force_2018
12
Employed_2018
12
Unemployed_2018
12
Unemployment_rate_2018
12
Less than a high school diploma, 1990
13
High school diploma only, 1990
13
Some college or associate's degree, 1990
13
Bachelor's degree or higher, 1990
13
Percent of adults with less than a high school diploma, 1990
13
Percent of adults with a high school diploma only, 1990
13
Percent of adults with a bachelor's degree or higher, 1990
13
Percent of adults completing some college or associate's degree, 1990
14
Civilian_labor_force_2007
14
Employed_2007
14
Unemployed_2007
14
Unemployment_rate_2007
14
Civilian_labor_force_2008
14
Employed_2008
14
Unemployed_2008
14
Unemployment_rate_2008
14
Civilian_labor_force_2009
14
Employed_2009
14
Unemployed_2009
14
Unemployment_rate_2009
14
Less than a high school diploma, 1980
17
High school diploma only, 1980
17
Some college (1-3 years), 1980
17
Four years of college or higher, 1980
17
Percent of adults with less than a high school diploma, 1980
17
Percent of adults with a high school diploma only, 1980
17
Percent of adults completing some college (1-3 years), 1980
17
Percent of adults completing four years of college or higher, 1980
17
Metro_2013
62
2003 Rural-urban Continuum Code
63
2003 Urban Influence Code
63
2013 Rural-urban Continuum Code
63
2013 Urban Influence Code
63
Rural-urban_Continuum Code_2013
64
Urban_Influence_Code_2013.x
64
Rural_urban_continuum_code_2013
65
Urban_influence_code_2013
65
Rural-urban_Continuum Code_2003
69
Urban_Influence_Code_2003.x
69
N_POP_CHG_2010
90
N_POP_CHG_2011
90
N_POP_CHG_2012
90
N_POP_CHG_2013
90
N_POP_CHG_2014
90
N_POP_CHG_2015
90
N_POP_CHG_2016
90
N_POP_CHG_2017
90
N_POP_CHG_2018
90
Births_2010
90
Births_2011
90
Births_2012
90
Births_2013
90
Births_2014
90
Births_2015
90
Births_2016
90
Births_2017
90
Births_2018
90
Deaths_2010
90
Deaths_2011
90
Deaths_2012
90
Deaths_2013
90
Deaths_2014
90
Deaths_2015
90
Deaths_2016
90
Deaths_2017
90
Deaths_2018
90
NATURAL_INC_2010
90
NATURAL_INC_2011
90
NATURAL_INC_2012
90
NATURAL_INC_2013
90
NATURAL_INC_2014
90
NATURAL_INC_2015
90
NATURAL_INC_2016
90
NATURAL_INC_2017
90
NATURAL_INC_2018
90
INTERNATIONAL_MIG_2010
90
INTERNATIONAL_MIG_2011
90
INTERNATIONAL_MIG_2012
90
INTERNATIONAL_MIG_2013
90
INTERNATIONAL_MIG_2014
90
INTERNATIONAL_MIG_2015
90
INTERNATIONAL_MIG_2016
90
INTERNATIONAL_MIG_2017
90
INTERNATIONAL_MIG_2018
90
DOMESTIC_MIG_2010
90
DOMESTIC_MIG_2011
90
DOMESTIC_MIG_2012
90
DOMESTIC_MIG_2013
90
DOMESTIC_MIG_2014
90
DOMESTIC_MIG_2015
90
DOMESTIC_MIG_2016
90
DOMESTIC_MIG_2017
90
DOMESTIC_MIG_2018
90
NET_MIG_2010
90
NET_MIG_2011
90
NET_MIG_2012
90
NET_MIG_2013
90
NET_MIG_2014
90
NET_MIG_2015
90
NET_MIG_2016
90
NET_MIG_2017
90
NET_MIG_2018
90
RESIDUAL_2010
90
RESIDUAL_2011
90
RESIDUAL_2012
90
RESIDUAL_2014
90
RESIDUAL_2015
90
RESIDUAL_2016
90
RESIDUAL_2017
90
RESIDUAL_2018
90
GQ_ESTIMATES_BASE_2010
90
GQ_ESTIMATES_2010
90
GQ_ESTIMATES_2011
90
GQ_ESTIMATES_2012
90
GQ_ESTIMATES_2013
90
GQ_ESTIMATES_2014
90
GQ_ESTIMATES_2015
90
GQ_ESTIMATES_2016
90
GQ_ESTIMATES_2017
90
GQ_ESTIMATES_2018
90
R_birth_2011
91
R_birth_2012
91
R_birth_2013
91
R_birth_2014
91
R_birth_2015
91
R_birth_2016
91
R_birth_2017
91
R_birth_2018
91
R_death_2011
91
R_death_2012
91
R_death_2013
91
R_death_2014
91
R_death_2015
91
R_death_2016
91
R_death_2017
91
R_death_2018
91
R_NATURAL_INC_2011
91
R_NATURAL_INC_2012
91
R_NATURAL_INC_2013
91
R_NATURAL_INC_2014
91
R_NATURAL_INC_2015
91
R_NATURAL_INC_2016
91
R_NATURAL_INC_2017
91
R_NATURAL_INC_2018
91
R_INTERNATIONAL_MIG_2011
91
R_INTERNATIONAL_MIG_2012
91
R_INTERNATIONAL_MIG_2013
91
R_INTERNATIONAL_MIG_2014
91
R_INTERNATIONAL_MIG_2015
91
R_INTERNATIONAL_MIG_2016
91
R_INTERNATIONAL_MIG_2017
91
R_INTERNATIONAL_MIG_2018
91
R_DOMESTIC_MIG_2011
91
R_DOMESTIC_MIG_2012
91
R_DOMESTIC_MIG_2013
91
R_DOMESTIC_MIG_2014
91
R_DOMESTIC_MIG_2015
91
R_DOMESTIC_MIG_2016
91
R_DOMESTIC_MIG_2017
91
R_DOMESTIC_MIG_2018
91
R_NET_MIG_2011
91
R_NET_MIG_2012
91
R_NET_MIG_2013
91
R_NET_MIG_2014
91
R_NET_MIG_2015
91
R_NET_MIG_2016
91
R_NET_MIG_2017
91
R_NET_MIG_2018
91
Stabr
91
Area_name.x
91
POVALL_2018
91
CI90LBAll_2018
91
CI90UBALL_2018
91
PCTPOVALL_2018
91
CI90LBALLP_2018
91
CI90UBALLP_2018
91
POV017_2018
91
CI90LB017_2018
91
CI90UB017_2018
91
PCTPOV017_2018
91
CI90LB017P_2018
91
CI90UB017P_2018
91
POV517_2018
91
CI90LB517_2018
91
CI90UB517_2018
91
PCTPOV517_2018
91
CI90LB517P_2018
91
CI90UB517P_2018
91
MEDHHINC_2018
91
CI90LBINC_2018
91
CI90UBINC_2018
91
Median_Household_Income_2018
91
RESIDUAL_2013
92
Med_HH_Income_Percent_of_State_Total_2018
92
Less than a high school diploma, 1970
98
High school diploma only, 1970
98
Some college (1-3 years), 1970
98
Four years of college or higher, 1970
98
Percent of adults with less than a high school diploma, 1970
98
Percent of adults with a high school diploma only, 1970
98
Percent of adults completing some college (1-3 years), 1970
98
Percent of adults completing four years of college or higher, 1970
98
Economic_typology_2015
142
Rural-urban_Continuum_Code_2013
143
Urban_Influence_Code_2013.y
143
Rural-urban_Continuum_Code_2003
148
Urban_Influence_Code_2003.y
148
POV04_2018
3232
CI90LB04_2018
3232
CI90UB04_2018
3232
PCTPOV04_2018
3232
CI90LB04P_2018
3232
CI90UB04P_2018
3232
All columns have at least one null (I suspect this is due to the parsing problem from pop_data), a large number have <20, another significant group have 50-100, and a handful have 140-150 or outright majority nulls. On closer inspection, the outright majority null columns are state-level data which is not necessary for my analysis. Many of the other nulls may be the result of state-levels rows. Before anything else, I’ll save a copy of the full data
write_csv(big_data, "../data/outer_join_R.csv")
And now to clean it up. First to take out the mostly null columns
(big_data <- select(big_data, -c(POV04_2018, CI90LB04_2018, CI90UB04_2018, PCTPOV04_2018, CI90LB04P_2018, CI90UB04P_2018)))
Now the number of rows with nulls is much more reasonable
sum(!complete.cases(big_data))
[1] 159
Let’s see if my suspicion there’s a fully null row is correct
filter(big_data, apply(is.na(big_data), 1, all))
Getting rid of the fully null row
big_data <- filter(big_data, !apply(is.na(big_data), 1, all))
Now let’s look at the remaining rows with at least one null
filter(big_data, apply(is.na(big_data), 1, any))
That’s mostly Puerto Rico and Alaska, as well as the summary rows for states and the entire country. Here’s the rows with at least one null which DON’T meet any of the criteria I just listed.
filter(big_data, apply(is.na(big_data), 1, any) & !(State.x %in% c("AK", "PR")) & (FIPS %% 1000 != 0))
At this point in the original pandas notebook I broke to review external research to better understand what was going on with these counties (and county equivalents). I decided to narrow down to just columns I thought I might actually use before cleaning further, as the time component complicated cleaning some counties pretty severely.
(big_slice <- select(big_data, FIPS, State.x, "Area name", '2013 Rural-urban Continuum Code', '2013 Urban Influence Code', "Percent of adults with less than a high school diploma, 2014-18", 'Percent of adults with a high school diploma only, 2014-18', "Percent of adults completing some college or associate's degree, 2014-18", "Percent of adults with a bachelor's degree or higher, 2014-18", 'Economic_typology_2015', 'POP_ESTIMATE_2010', 'POP_ESTIMATE_2018', 'R_NATURAL_INC_2018', 'R_NET_MIG_2018', 'PCTPOVALL_2018', 'Unemployment_rate_2010', 'Unemployment_rate_2018', 'Median_Household_Income_2018'))
NA
Now I’ll get rid of Alaska, Hawaii, Puerto Rico, and the summary rows
(big_slice <- filter(big_slice, !(State.x %in% c("AK", "PR", "HI")) & (FIPS %% 1000 != 0)))
Now let’s see what nulls remain
filter(big_slice, apply(is.na(big_slice), 1, any))
Based on external research, I think I can drop all three of these without losing much
(big_slice <- drop_na(big_slice))
Now to finish the cleaning. I’ll make FIPS an int in R instead of a chr for mapping reasons which will come up later
(big_slice <- mutate(big_slice, across(FIPS, as.integer)))
The Rural-urban codes should be ints as well
(big_slice <- mutate(big_slice, across("2013 Rural-urban Continuum Code":"2013 Urban Influence Code", as.integer)))
Economic typology should really be a chr
(big_slice <- mutate(big_slice, across(Economic_typology_2015, as.character)))
Population estimates are currently doubles but should be integers
(big_slice <- mutate(big_slice, across(POP_ESTIMATE_2010:POP_ESTIMATE_2018, as.integer)))
And last but not least for type changes, median household income is currently a chr with dollar signs and thousand separators, neither of which we want
big_slice$Median_Household_Income_2018 <- str_replace_all(big_slice$Median_Household_Income_2018, "[[:punct:]/$]", "")
(big_slice <- mutate(big_slice, across(Median_Household_Income_2018, as.integer)))
Much better. Now I can create some new variables
(big_slice <- mutate(big_slice,
Unemployment_change = Unemployment_rate_2018 - Unemployment_rate_2010,
Population_percent_change = (POP_ESTIMATE_2018 - POP_ESTIMATE_2010) / POP_ESTIMATE_2010
))
And also one-hot-encode the Economic Typology
big_slice %>%
mutate(ones = 1) %>%
pivot_wider(
names_from = Economic_typology_2015,
values_from = ones,
names_prefix = "Econ_Typology_",
names_sort = TRUE,
values_fill = 0,
values_fn = as.integer)
Saving a copy of the modeling-ready data
write_csv(big_slice, "../data/processed_R.csv")
Let’s specify the X and y to use for modeling - more features may be used later
(X <- select(big_slice, 6, 13, 14, 17, 18, 19, 20))
y <- big_slice$PCTPOVALL_2018
Train test split
train_len <- as.integer(0.75 * dim(X)[1])
train_test_split <- sample(c(rep(0, train_len), rep(1, dim(X)[1] - train_len)))
X_train <- X[train_test_split == 0, ]
X_test <- X[train_test_split == 1, ]
y_train <- y[train_test_split == 0]
y_test <- y[train_test_split == 1]
Scaling the data
X_train_sc <- scale(X_train)
Can we get that scale back?
train_scale <- attributes(X_train_sc)$`scaled:scale`
train_centers <- attributes(X_train_sc)$`scaled:center`
And now to apply that to X_test
X_test_sc <- as_tibble(scale(X_test, center = train_centers, scale = train_scale))
X_sc <- as_tibble(scale(X, center = train_centers, scale = train_scale))
X_train_sc <- as_tibble(X_train_sc)
Now a linear model
linreg <- lm(y_train ~ Unemployment_rate_2018 + `Percent of adults with less than a high school diploma, 2014-18` + R_NATURAL_INC_2018 + R_NET_MIG_2018 + Median_Household_Income_2018 + Unemployment_change + Population_percent_change, data = X_train_sc)
Looking at the model summary - note that the R2 isn’t great, but not bad at all
summary(linreg)
Call:
lm(formula = y_train ~ Unemployment_rate_2018 + `Percent of adults with less than a high school diploma, 2014-18` +
R_NATURAL_INC_2018 + R_NET_MIG_2018 + Median_Household_Income_2018 +
Unemployment_change + Population_percent_change, data = X_train_sc)
Residuals:
Min 1Q Median 3Q Max
-16.0391 -1.8902 -0.4046 1.3001 24.0312
Coefficients:
Estimate
(Intercept) 15.19936
Unemployment_rate_2018 1.26757
`Percent of adults with less than a high school diploma, 2014-18` 1.69539
R_NATURAL_INC_2018 0.28640
R_NET_MIG_2018 -0.60838
Median_Household_Income_2018 -3.43256
Unemployment_change -0.11910
Population_percent_change 0.43550
Std. Error
(Intercept) 0.06619
Unemployment_rate_2018 0.07794
`Percent of adults with less than a high school diploma, 2014-18` 0.08450
R_NATURAL_INC_2018 0.08785
R_NET_MIG_2018 0.09590
Median_Household_Income_2018 0.09375
Unemployment_change 0.07581
Population_percent_change 0.11117
t value
(Intercept) 229.619
Unemployment_rate_2018 16.263
`Percent of adults with less than a high school diploma, 2014-18` 20.063
R_NATURAL_INC_2018 3.260
R_NET_MIG_2018 -6.344
Median_Household_Income_2018 -36.613
Unemployment_change -1.571
Population_percent_change 3.917
Pr(>|t|)
(Intercept) < 2e-16
Unemployment_rate_2018 < 2e-16
`Percent of adults with less than a high school diploma, 2014-18` < 2e-16
R_NATURAL_INC_2018 0.00113
R_NET_MIG_2018 2.68e-10
Median_Household_Income_2018 < 2e-16
Unemployment_change 0.11630
Population_percent_change 9.21e-05
(Intercept) ***
Unemployment_rate_2018 ***
`Percent of adults with less than a high school diploma, 2014-18` ***
R_NATURAL_INC_2018 **
R_NET_MIG_2018 ***
Median_Household_Income_2018 ***
Unemployment_change
Population_percent_change ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.196 on 2323 degrees of freedom
Multiple R-squared: 0.7271, Adjusted R-squared: 0.7262
F-statistic: 884 on 7 and 2323 DF, p-value: < 2.2e-16
And the sorted coefs. NB: I didn’t sort by absolute value here, so they aren’t in order of magnitude of effect.
sort(coef(linreg))
Median_Household_Income_2018
-3.4325582
R_NET_MIG_2018
-0.6083822
Unemployment_change
-0.1191027
R_NATURAL_INC_2018
0.2864007
Population_percent_change
0.4354999
Unemployment_rate_2018
1.2675679
`Percent of adults with less than a high school diploma, 2014-18`
1.6953941
(Intercept)
15.1993565
How does it perform on the test set? Manually calculating Adj. R2 on test set
1 - (sum((y_test - add_predictions(X_test_sc, linreg)$pred)^2) / (dim(X_test_sc)[1] - dim(X_test_sc)[2] - 1)) / (sum((y_test - mean(y_test))^2) / (dim(X_test_sc)[1] - 1))
[1] 0.7005477
Now to get predictions on the entire scaled dataset (for mapping)
X_sc <- add_predictions(X_sc, linreg)
Mapping the error
palette(rainbow(50, start = 0, end = 0.3, rev = TRUE))
offset <- abs(min(X_sc$pred))
regions <- inner_join(county.fips, mutate(big_slice, pred = X_sc$pred), by = c("fips" = "FIPS"))
map('county', region = regions$polyname, exact=TRUE, fill = TRUE, col = (regions$pred + offset), lwd = 0.2)
FIPS code 46102 refers to Oglala Lakota County, which doesn’t appear in this fips directory due to a name change from Shannon County in 2014 which also changed the FIPS code.
filter(county.fips, polyname == "south dakota,shannon" | fips == 46102)
Plotting the predictions against actual to visually evaluate the model. I’d love to use a diverging palette centered on 0 here, but I’m moving on for the sake of time.
ggplot(data = X_sc) +
geom_point(mapping = aes(x = y, y = pred, color = (y - pred)))
Now let’s look at a model accounting for interactions
linreg2 <- lm(y_train ~ Unemployment_rate_2018 * `Percent of adults with less than a high school diploma, 2014-18` * R_NATURAL_INC_2018 * R_NET_MIG_2018 * Median_Household_Income_2018 * Unemployment_change * Population_percent_change, data = X_train_sc)
#summary(linreg2)
So that summary ended up being kind of huge, but hey! Significantly higher R^2 values and still a ridiculously small p-value. Good stuff. Summary commented out for cleanliness. Mapping the error for this one - still need to get better at working with palettes to have this be on the same scale as the previous map.
regions2 <- inner_join(county.fips, mutate(big_slice, pred = add_predictions(X_sc, linreg2)$pred), by = c("fips" = "FIPS"))
offset2 <- abs(min(regions$pred))
map('county', region = regions2$polyname, exact=TRUE, fill = TRUE, col = (regions$pred + offset2), lwd = 0.2)
And the test R^2 for linreg2
1 - (sum((y_test - add_predictions(X_test_sc, linreg2)$pred)^2) / (dim(X_test_sc)[1] - dim(X_test_sc)[2] - 1)) / (sum((y_test - mean(y_test))^2) / (dim(X_test_sc)[1] - 1))
[1] 0.7716487
Testing R^2 has improved as well, though not by as much, which is to be expected. Plotting the predicted against actual again for the interaction mode. Same caveat with the color palette as the similar plot above.
ggplot(data = add_predictions(X_sc, linreg2)) +
geom_point(mapping = aes(x = y, y = pred, color = (y - pred)))
And plotting error against actual
ggplot(data = add_predictions(X_sc, linreg2)) +
geom_point(alpha = 0.2, mapping = aes(x = y, y = (y - pred))) +
geom_ref_line(h = 0, colour = "red")
The errors visibly curve upwards at higher values. Investigating some of the worst errors. Still getting the hang of pipes.
big_slice %>% mutate( ErrorPoly = add_predictions(X_sc, linreg2)$pred - y) %>% arrange(desc(ErrorPoly))
Worth noting that if I did this correctly (not an insignificant “if”) my highest errors are quite different than the original Python ones. Strange and worth looking into.
Now the main event - Kmeans! Using the Lloyd algorithm to match with the version I created in sklearn.
(clustering <- kmeans(select(X_sc, !pred), centers = 8, iter.max = 1000, nstart = 100, algorithm = "Lloyd"))
K-means clustering with 8 clusters of sizes 540, 197, 555, 479, 194, 691, 200, 252
Cluster means:
Percent of adults with less than a high school diploma, 2014-18
1 0.50434023
2 1.55167945
3 0.02795914
4 -0.74836368
5 -0.32018264
6 -0.39873646
7 1.43060433
8 -0.93149162
R_NATURAL_INC_2018 R_NET_MIG_2018 Unemployment_rate_2018
1 -0.6263823 0.3595697 0.3104808
2 -0.2309517 -0.8658475 2.1137316
3 -0.6755405 -0.4238099 0.5229862
4 0.0676130 -0.6335739 -0.8609722
5 0.7051512 1.8549927 -0.4078850
6 0.1316757 0.3582573 -0.2998739
7 1.5197683 -0.7205129 -0.1399658
8 0.8441239 0.2856057 -0.6806658
Median_Household_Income_2018 Unemployment_change Population_percent_change
1 -0.6308310 -1.2286659 -0.21441392
2 -1.1651410 -0.5631025 -0.84830246
3 -0.5067658 0.2504223 -0.64959653
4 0.1106185 1.1713470 -0.42662892
5 0.8471009 -0.1438088 2.32917032
6 0.2096499 -0.0819266 0.26733050
7 -0.3345929 0.5519030 0.08145902
8 2.1740270 0.3825758 0.87786462
Clustering vector:
[1] 6 5 2 1 1 2 1 3 1 1 1 1 2 1 1 6 1 1 1 3 1 1 3 2 1 6 1 1 1 7 1 2 1 1 6
[36] 1 6 1 3 1 5 5 2 2 6 1 1 6 3 1 6 1 2 1 3 1 1 6 8 2 1 1 6 1 2 2 1 2 3 6
[71] 1 6 6 1 5 1 2 6 5 2 6 2 1 3 1 5 6 1 6 6 2 3 3 3 3 3 3 6 6 3 3 3 3 3 6
[106] 3 1 6 6 6 7 6 3 6 3 2 3 7 3 3 2 3 3 3 6 7 1 4 2 3 3 3 3 3 3 2 3 3 3 6
[141] 3 4 1 2 5 3 3 7 7 1 3 3 1 5 3 3 7 8 3 1 1 1 2 8 1 8 2 2 6 2 6 2 2 1 1
[176] 7 2 8 1 6 2 2 6 7 8 6 8 8 1 5 6 5 7 8 8 1 6 8 8 8 8 1 1 1 8 8 1 2 1 1
[211] 2 1 8 6 1 5 6 8 5 4 3 8 8 5 4 8 3 1 3 5 1 5 1 8 8 8 5 1 8 8 8 5 6 1 6
[246] 8 6 4 6 8 5 1 6 4 6 5 6 6 6 7 3 8 5 4 8 7 6 4 3 8 5 5 8 4 8 6 4 5 4 8
[281] 6 6 8 6 6 8 6 6 6 5 8 6 6 6 1 6 6 1 5 1 5 5 6 1 1 6 6 5 3 1 1 1 1 1 7
[316] 7 1 1 5 1 5 3 1 1 5 5 6 1 6 1 5 1 6 6 4 5 5 1 5 5 6 5 6 5 1 5 5 5 5 6
[351] 5 1 1 3 1 6 5 1 1 1 7 2 1 6 5 6 2 1 1 1 1 3 5 6 2 1 3 6 1 6 6 1 6 7 1
[386] 5 6 2 6 7 8 1 7 5 1 5 1 1 3 5 1 6 1 1 1 6 1 7 5 1 1 7 1 8 1 5 1 8 1 3
[421] 6 1 1 1 8 1 5 2 1 8 1 1 5 6 1 5 1 1 2 1 1 6 1 7 1 6 7 1 5 6 6 1 1 2 6
[456] 2 1 3 2 6 1 6 2 6 6 5 1 5 1 6 6 6 1 3 1 2 1 2 7 6 1 1 1 1 1 1 2 3 1 7
[491] 2 2 2 3 1 2 1 2 1 1 2 5 1 1 5 1 1 1 1 2 2 6 7 1 1 1 1 5 1 6 6 1 6 8 5
[526] 5 8 6 4 6 5 6 7 7 1 6 6 8 6 6 7 1 5 7 5 6 6 3 7 7 7 4 6 7 6 7 1 8 6 5
[561] 6 4 2 3 6 4 3 3 3 3 6 3 3 3 6 3 6 3 6 6 3 7 8 3 3 6 3 3 3 3 3 3 8 3 3
[596] 2 3 3 3 3 6 3 3 3 2 8 3 8 3 8 3 3 3 3 3 3 8 4 3 3 6 3 3 3 3 4 3 8 3 3
[631] 6 6 3 3 4 3 3 2 6 3 3 3 3 3 4 3 3 6 3 3 3 3 3 6 4 4 3 3 3 8 3 1 4 7 6
[666] 6 6 1 8 6 6 1 6 6 6 1 7 6 6 6 1 6 6 1 6 1 6 6 4 1 3 8 8 6 8 1 1 6 6 6
[701] 6 6 1 8 4 6 7 3 1 1 1 6 6 6 1 6 6 6 1 1 1 1 1 6 6 6 6 6 3 6 1 6 1 6 1
[736] 6 6 1 6 3 6 6 6 3 6 1 6 1 6 8 1 1 6 6 6 4 4 6 6 4 4 4 4 8 4 7 4 4 4 4
[771] 4 4 4 4 6 4 4 4 7 5 7 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 4 4 4 4 6 6
[806] 6 8 4 4 4 3 8 4 4 4 8 4 6 4 4 4 4 4 4 6 4 4 4 4 4 4 8 4 4 4 4 6 4 8 4
[841] 4 4 4 4 6 8 4 4 4 4 4 6 4 4 3 4 3 4 4 3 4 6 4 3 3 4 4 4 4 4 4 4 6 4 4
[876] 6 6 4 3 4 4 7 7 6 7 4 3 7 7 4 3 7 3 4 7 4 4 4 4 8 7 4 4 3 4 8 4 3 4 4
[911] 4 4 4 4 8 4 3 6 4 4 3 4 4 6 4 4 4 4 8 4 4 4 4 4 4 4 4 4 4 4 6 7 4 4 4
[946] 4 4 4 7 4 4 4 4 4 4 4 3 3 7 1 1 6 3 1 2 2 8 6 3 6 3 2 1 6 1 3 6 6 3 1
[981] 2 1 7 6 2 2 3 1 6 1 2 1 6 1 2 6 2 1 1 6
[ reached getOption("max.print") -- omitted 2108 entries ]
Within cluster sum of squares by cluster:
[1] 1576.3545 1099.2593 1408.4598 1144.5249 1262.7983 1552.7751 1084.8146
[8] 893.7738
(between_SS / total_SS = 53.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
Checking the attributes
attributes(clustering)
$names
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
$class
[1] "kmeans"
attributes(clustering)$centers
NULL
I am still confused by attributes in R…